“To what extent can the price of a laptop be predicted based on some of its specifications”
We can use this model as a way for people to identify what laptop type they are looking to purchase based on the series of specifications they want. Thus, the initial question we were looking to investigate was:To what degree does the existence of various specifications influence price
Our selected data sets contain information on 1,196 laptops with 23 variables. “Laptop_data_final” is a merged dataset of two separate collections, cleaned to limit the laptops to the following specifications which we deemed as relevant in our research:
Company: The manufacturer of the laptop, which could influence pricing based on brand reputation and target market.
TypeName: The category of the laptop (i.e. 2 in 1 Convertible, Netbook, Workstation, Ultrabook, Notebook, Gaming). Different types cater to specific user needs, which likely impacts pricing, specifications offered, and weight.
Ram: RAM size in GB is a critical performance factor. High RAM capacities (e.g., 16GB, 32GB) are often associated with higher prices.
Weight: Indicates portability, an important aspect of laptop design. Lighter laptops (e.g. Ultrabooks) may command premium prices.
Log_price: The target variable for our analysis, represents the laptop’s cost. We plan to understand how other variables influence it. However, this number was represented as the natural log of the price of laptops in Rupees. Initially, we believe this was done to present a more linear graph to represent the price. Due to this, we produced a new variable titled “price” and converted this to pounds in our initial data cleaning to demonstrate a more understandable and relevant variable.
TouchScreen: Whether a laptop has a touchscreen feature, a potential indicator of modern or premium models and is usually charged at a higher rate.
Ips: The presence of an IPS display, which typically provides better color accuracy and viewing angles, is often seen in higher-priced laptops.
Ppi (Pixels Per Inch): Indicates display resolution and clarity. Higher PPI is often correlated with premium displays.
Cpu_brand: The processor brand/model (e.g., Intel Core i5, AMD Ryzen). CPU type directly impacts the laptop’s performance and price.
HDD: Capacity of traditional hard drives in GB. Larger HDDs might affect price, though SSDs are now more common in modern laptops.
SSD: Capacity of solid-state drives in GB. SSDs are faster and more expensive than HDDs, significantly impacting pricing.
Storage_type: An additional variable created by us to determine whether the laptop possessed either HDD or SSD
Gpu_brand: Graphics card brand (e.g., Intel, NVIDIA, AMD). Gaming laptops or those with high-end GPUs are likely more expensive.
Os: The operating system installed (e.g., Mac, Windows, Linux, Others). Mac laptops, for instance, are typically priced higher.
Inches: Decribes the length of one corner of the laptop to the other generally adjusted to suit the laptop type.
Storage capacity: represents the total storage space available in each laptop, measured in gigabytes (GB).
GHz: represents the processor clock speed of laptops, measured in gigahertz (GHz). This is an indicator of the CPU’s performance, with higher GHz values generally corresponding to faster processing speeds.
This dataset thus allows us to explore the relationships between a laptop’s specifications, the company selling the laptop, and their prices.
This report presents findings from the analysis of two laptop datasets -one uncleaned and the other cleaned - available on Kaggle to identify patterns and factors affecting price. The dataset contains the information of 1,196 laptops with 23 specifications, such as Company, RAM, Weight and Storage type. Initially, the objective of this analysis was to create a model to aid people in identifying the laptop type and company they should likely purchase from, based on a series of specifications they would like. Upon further exploratory analysis, we limited our scope to an exploration of how price is influenced by the presence of such specifications. The following sections outline our methods, results and evaluation of the findings as well as some issues we faced and how they were overcome.
It must first be acknowledged that this analysis is subject to certain limitations and should be interpreted with caution. The scope of the laptop statistics was constrained to the companies of Acer, Asus, Dell, HP, Lenovo and Other; categorised into one of the following categories: 2 in 1 Convertible, Netbook, Workstation, Ultrabook, Notebook or Gaming. Thus, while the analysis provides valuable insights, it is not exhaustive and may not capture all potential variables or nuances relevant to the study. Any assumptions made have been mentioned in the ‘data cleaning and wrangling’ and ‘exploratory data analysis’ sections.
First, the two datasets were merged and any duplicate variables were subsequently removed. From the HDD and SSD variables, we created a variable denoting the total amount of storage in the laptop, and subsequently removed any observations with a storage capacity of 0 as this is not possible; such was one of the assumptions we made. Furthermore, we created a variable that denoted whether a laptop had SSD storage, HDD storage or both called “storage_type”.
At this point, we surmised it would be best to limit the company variable (describing the company from which the laptop is purchased) to only the companies with over 100 laptops, all other companies were then denoted as “other”, this was done to simplify the model as there was a large number of companies with few entries:
| Company | n |
|---|---|
| Huawei | 2 |
| Fujitsu | 3 |
| 3 | |
| LG | 3 |
| Xiaomi | 4 |
| Mediacom | 5 |
| Microsoft | 6 |
| Razer | 7 |
| Samsung | 7 |
| Apple | 11 |
| Toshiba | 48 |
| MSI | 54 |
| Acer | 88 |
| Asus | 137 |
| HP | 260 |
| Lenovo | 270 |
| Dell | 288 |
The price variable was given originally as log transformed and in rupees. We converted Price back into pounds and also decided to log transform the price in pounds as the data was skewed rightward:
Using log price also allowed us to avoid the fan-shaped residual
plot:
Finally we were also able to extract the clock speed of the processor from the uncleaned dataset and store it as a numeric variable.
Our dataset contained no missing values, and so we created various visualisations to improve our understanding of relations between price and other variables as well as variation of numeric variables in relation to certain character variables such as computer type and company.
Here is a correlation matrix of these - As we can see from the graph, Price has its highest correlation with RAM which was largely expected, and the amount of SSD storage which was more surprising as both the total storage amount and HDD both have a low correlation
This positive correlation was expected, because the RAM is a major specification in laptops. The points are gathered in vertical lines because RAM (and most storage) is usually in powers of two (e.g 256, 512).
SSD shows a similar trend to RAM, and also has the large groups around powers of two again, as expected for computer storage.
The above 3d plot shows the relationship between price and both the numeric values at the same time.
As we can see in the visualisation above processor speed for certain GHz values tends to be always slightly lower or higher than fitted line, revealing that clockspeed fails to entirely capture the processor performance. We did not try to consider Ghz as categorical because there are too many values with very little entries and also the dispersion of the values is often quite large.
Building on what we’ve learned we then created multiple models to evaluate which variables yield such improvement for the model that they are worth keeping or they should not be considered.
With all this information we then created a model predicting price using the aforementioned numerical and categorical variables.
## # A tibble: 21 × 3
## term estimate exp_estimate
## <chr> <dbl> <dbl>
## 1 (Intercept) 5.48 239.
## 2 Ram 0.0266 1.03
## 3 SSD 0.0008 1.00
## 4 TypeNameGaming 0.0991 1.10
## 5 TypeNameNetbook 0.0463 1.05
## 6 TypeNameNotebook -0.164 0.849
## 7 TypeNameUltrabook 0.101 1.11
## 8 TypeNameWorkstation 0.493 1.64
## 9 OsOthers -0.466 0.627
## 10 OsWindows -0.163 0.849
## # ℹ 11 more rows
This can be represented as a linear equation:
\[ \widehat{logPrice} = 5.48 + (0.0266 \times Ram) + (0.0008 \times SSD) + \dots \]
As all categorical variables have been converted into dummy variables, our model can be interpreted as a straight line with a gradient based on RAM and SSD and with an intercept that changes depending on the categorical variables.
To interpret the slope, we substitute each estimate into \(e^x\) to generate a column called
exp_estimate.
We can then say that all else held constant that for every additional GB of Ram, the estimated price increases on average by a factor of 1.03 and for every additional GB of SSD storage, the estimated price increases on average by a factor of 1.0008. Furthermore, to interpret the intercept we can also say that if a laptop is a gaming laptop then the estimated price increases by a factor of 1.1042 compared to the baseline of 2-in-1 convertible, and that if the laptop is a notebook the price decreases by a factor of 0.8492 compared to a 2-in-1 convertible.
It is worth noting that the baseline intercept price of £268.52 does not make sense in the context of the data as a laptop cannot have 0 RAM.
To evaluate the model we calculated the RMSE value and the adjusted R^2 value to measure the accuracy and variability within our model, for this we found RMSE as 0.286 and adj R^2 as 0.717. We can interpret these results as meaning that on average our model is off by around 28.6% and that our model can explain around 70% of the variability in the price.
To conclude, the price of a laptop can be predicted to be within a reasonable range of uncertainty, and the model would inform you if you are being massively over or undercharged with decent certainty. Furthermore, generally the variables that tend to influence price the most are variables linked to a higher processing power, not so much the quality of life features.
However, there are several limitations to consider within this report. Firstly, as previously mentioned, the scope of our dataset was limited to only 1,196 laptops and 23 specifications - which we further cut down - explaining the imperfect RMSE and adjusted R^2 values. This means the limited sample size will not fully represent the broader context of all laptops ever. Additionally, linking to this topic is the issue of the rapidly progressing technological market which means any and all research done on laptop types will be limited to the time frame in which the exploration was conducted. For example, new types of storage could be introduced which would consequently reduce the price of laptops with the more dated storage specifications. Finally, the laptop prices, in pounds, were converted from Indian rupees. This means the exchange rate used is bound to become more and more outdated as time goes on as the value of both currencies vary according to each country’s economic status.
Moving forward, we would address these limitations by collecting additional data with a focus on the UK to remove the need for currency exchange conversions, which will also help in enhancing analytical precision. Future analyses could also incorporate investigations of the relationship between RAM, SSD, and any other storage specification. Perhaps this would help increase the precision and sensitivity of our model. We could also consider setting RAM as a categorical variable, since RAM is largely discrete values.
Kushwaha, G. P. (2023, June 15). Laptop price prediction cleaned dataset. 2024 November 13, Kaggle. https://www.kaggle.com/datasets/gyanprakashkushwaha/laptop-price-prediction-cleaned-dataset
R, J. D. (2023, November 30). Laptop price prediction dataset. 2024 November 13, Kaggle. https://www.kaggle.com/datasets/jacksondivakarr/laptop-price-prediction-dataset